Indraneel's blog

Feeble attempts at grokking the incomprehensible.

Patching (live or otherwise) and versioning a service

Most services when they are small, can afford to take downtimes and in some cases that may be the most optimal thing to do. Eventually though the service matures and your customers find downtime for upgrade unacceptable. Live patching is the concept of using specific strategies to keep the service up while changing the underlying code. When patching a service, you introduce a before and after version of the service and it makes sense to make that explicit. Today, we visit the concepts to remember when working on a patching and versioning strategy

Strategy 0 - Take downtime

While it sounds simple, there are quite a few things needed to get this right. You need to decide on a date in advance, estimate the amount of downtime and post notice on your site for the users. A week's notice is a good rule of thumb. Select a time when you have the resources to handle a patch well and optionally ensure that you have the least number of customers online. If you have customers in multiple time zones, be mindful of that and at minimum state the time window in UTC. In some cases, you may want to keep such a canned message handy, especially if the message needs localizing.

On the day/time of the downtime, divert the traffic to a status page which explains what is happening and when your site will be back up. It is essential that you have attempted dry runs of the patch and have a good sense of how long it would take to finish. It is also essential to have a team who can handle the patching and a backup team to take over if the patch schedule slips for some reason. In this form of patching, you can apply whatever techniques provide the fastest patching strategy, including rebooting services/servers/databases etc. Most people will stop any background tasks (cron jobs), update the data and the services, restart everything and run some sanity tests and then redirect the traffic to the site.

In case of a standalone service, there is no need to version such a service since at any given point of time, there is only 1 version of the service running. You may need to version the service if however, you and your callers have agreed to do so.

Strategy 1 - Partition the service and then patch

Going back to the canonical environment, the patch would likely proceed thus

  1. One (set) of the front end and middle tier box would be taken offline. All of your network traffic goes to the remaining active front end boxes. All custom work gets delegated on the remaining middle tier box
  2. Shut down the web services and middle tier services on the boxes that just got taken offline. Confirm that the boxes stay online, have taken over all work responsibilities
  3. Change your SQL code using specific patterns that create new DB entities while leaving the old entities around
  4. Replace the code in the front end and middle tier boxes using your preferred installer. Keep the services shut down
  5. Run any configuration steps on the machines as necessary
  6. Restart the services that had been shut down on the front end and middle tier boxes and confirm that they have saturated themselves with data as appropriate.
  7. Take the front end and middle tier boxes online
  8. Take the other set of front end and middle tier boxes offline and apply steps 1-7 in sequence. You can choose to skip patching SQL side this time as an optimization

What that means

In this case, we partitioned the service 2 ways. You can partition it more ways if you have the hardware for it and the software to co-ordinate it correctly. The strategy also relies on the fact that there be clean separation of responsibilities between the code and the data layer. With respect to versioning, the service must handle that post step 7 above, there may be 2 versions of the service running concurrently at least for a short period of time. Similarly, any one calling into the service can get 2 different versions of UI or something equally confusing.

Some mitigation options

While discussing how a service behaves when 2 versions are running together is a complex discussion worthy of a separate post, I quickly wanted to enumerate the options available here

  1. Leader election for middle tier services: Between step 6 from the 1st half, and step 2 of the 2nd half above, there are 2 versions of the services running which may well be stepping on each other. A reasonable strategy may well be, use the database for electing a leader. If a service is the leader, it executes on the queued up work. Any service that is not the leader can just idle. If an idle service notices that the leader has not posted a heartbeat in a while, it can choose to nominate itself as the leader. This way, 2 versions of a service dont step on each other
  2. UI discrepancies due to different versions of front end services can be mitigated if the database can be used again to decide the type of UI that should be exposed. The newer front end services can show you the old UI, because that is what the database setting is making them expose. When the entire deployment completes, a single command can be issued that then changes the setting. FE services should be periodically polling for the DB setting. If they see the change, then they should start showing the new UI.

Strategy 2 - Have 2 versions of service running concurrently

The patch may proceed using the below steps

  1. Apply a versioned schema on the database. This means entirely new sets of stored procedures and temporary tables are created. The data itself is stored with a version column indicating the service version that last touched it
  2. Install the new services on the front end and middle tier boxes. These service are versioned at every level. A versioned service will not talk to a service from a different version. All incoming requests to the front end services are versioned. Some mechanism exists that allows service version 1 to forward a request to the same service version 2.
  3. Wait to confirm that the new services are all online and ready. Initiate a background task, that starts migrating the data to the new version
  4. When a new request shows up on the frond end, it requests to talk to front end service version 1. If however the corresponding data for it has been marked as version 2 by the background task, then the service forwards the request to its version 2 counterpart
  5. The version 2 front end service then handles the request, making more calls into the middle tier and database as necessary. The response is now marked as version 2. The client must now remember that and use version 2 for any subsequent calls
  6. Once the background task has completed, the 2 service versions co-exist but over a period of time, all clients swith to version 2
  7. You may now uninstall the 1st version from the service. Any client that now tries to connect to version 1 will fail and then must fall back to query for the version and discover version 2 for themselves

What that means

You need to over provision for hardware, so that you can run 2 versions of the service. You must be extremely deliberate about versioning the code and the data. You however get the option of being able to actually roll out your new service to a select set of customers and then roll back the change if you detect any problems by uninstalling version 2 if needed.